By this point we already have the fingerprints for Ones,Z,Chi and oxidation number calculated. These are stored, along with the chemical formulae, in the file Fingerprint_lt50.csv in the data folder. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated.
In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
Df=pd.read_csv("../data/FingerPrint_lt50.csv",sep='\t',index_col=0)
In [6]:
Df.head()
Out[6]:
We load up all the formulas and the fingerprints and check that they have the correct shapes
In [7]:
Formulas=Df["Formula"].values
In [8]:
Fingerprint=Df.drop("Formula",axis=1).values
In [6]:
Fingerprint.shape
Out[6]:
In [7]:
Formulas.shape
Out[7]:
We perform hundred component PCA on the fingerprints and then check the number of components we want to keep in order to retain a large part of the covariance. We plot the total covariance captured as a function of number of components kept in the plot below
In [9]:
from sklearn.decomposition import PCA
In [10]:
pca=PCA(n_components=100)
In [11]:
pca_fing=pca.fit_transform(Fingerprint)
In [13]:
plt.plot(np.arange(1,101),np.cumsum(pca.explained_variance_ratio_))
plt.ylabel("Explained Variance Ratio")
plt.xlabel("Number of Components")
Out[13]:
We enumerate the elements of the plot below
In [12]:
list(enumerate(np.cumsum(pca.explained_variance_ratio_)))
Out[12]:
With 50 components we might already be pushing the limits for DBSCAN clustering which we need to try. So lets stick to 50 and then ramp it up if necessary. We are capturing almost 96% of the covariance in this scenario
In [13]:
pca_fing50=pca_fing[:,0:50]
In [14]:
from sklearn.cluster import KMeans
In [15]:
Km=KMeans(n_clusters=15,random_state=42,n_init=50)
In [16]:
clust_km50=Km.fit_predict(pca_fing50)
We output the number of elements in each cluster
In [17]:
from collections import Counter
print Counter(clust_km50)
In [147]:
from sklearn.metrics.pairwise import euclidean_distances
Using Pandas sorting routines to sort the Fingerprints.
In [19]:
dist_center=euclidean_distances(Km.cluster_centers_,Km.cluster_centers_[0])
dist_sorted=sorted(dist_center)
sort_key=[]
for i in range(15):
sort_key.append(np.where(dist_center==dist_sorted[i])[0][0])
In [20]:
clust_km50_s=np.zeros(len(clust_km50))
for i in range(15):
clust_km50_s[clust_km50==sort_key[i]]=int(i)
In [21]:
Counter(clust_km50_s)
Out[21]:
In [22]:
Df["clust_pca50"]=clust_km50_s
In [23]:
Df_sorted=Df.sort_values(by="clust_pca50")
In [24]:
Formula_s=Df_sorted["Formula"]
In [25]:
Finger_s=Df_sorted.drop(["Formula","clust_pca50"],axis=1).values
In [26]:
Finger_s.shape
Out[26]:
We now perform pca on this new sorted set of fingerprints. We could have just sorted the earlier pca fingerprints instead, however PCA is cheap and we did not have the PCA fingerprints in the pandas dataframe
In [27]:
pca2=PCA(n_components=50)
fing_pca50_s=pca2.fit_transform(Finger_s)
print np.sum(pca2.explained_variance_ratio_)
We now calculate eulidean distances and plot the similarity matrix
In [28]:
dist_fing_s=euclidean_distances(fing_pca50_s)
In [29]:
np.amax(dist_fing_s)
Out[29]:
In [31]:
clust_km50_s=Df_sorted["clust_pca50"].values
#fing_714=Finger_s[clust_km50_s>6]
#dist_fing_714=euclidean_distances(fing_714)
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s[::2,::2],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()
Out[31]:
In [32]:
fing_13=fing_pca50_s[clust_km50_s==10]
plt.figure(figsize=(8,8))
plt.imshow(euclidean_distances(fing_13),cmap=plt.cm.viridis)
plt.colorbar()
Out[32]:
I basically searched over a lot of eps and min_smaples values but never ended up getting very useful clusters. Either the number of unclassified points was too high or the clustering was terrible. We should probably do a proper GridSearchCV on this, but then the validation score isnt well defined. Most of the cells have been deleted in this section
Thought for later: Maybe use the adjusted_rand_score with the Kmens clusters as a validation metric.
In [71]:
from sklearn.cluster import DBSCAN
In [89]:
Db=DBSCAN(eps=1.5,min_samples=5,metric='precomputed')
In [90]:
for eps in np.linspace(0.5,12.0,25):
Db.eps=eps
clust_db=Db.fit_predict(dist_fing_s)
print eps, Counter(clust_db)[-1], np.amax(clust_db)
In [161]:
from sklearn.cluster import AgglomerativeClustering
In [162]:
Ag=AgglomerativeClustering(n_clusters=5)
In [ ]:
clust_ag=Ag.fit_predict(fing_pca50_s[:,0:10])
In [33]:
from sklearn.metrics.pairwise import cosine_distances
In [35]:
clust_km50_s=Df_sorted["clust_pca50"].values
dist_cosine_s=cosine_distances(fing_pca50_s)
plt.figure(figsize=(10,10))
plt.imshow(dist_cosine_s[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting after using cosine distances")
plt.colorbar()
Out[35]:
Note that in the plot above we plot every second column and row because imshow otherwise took too long
In [36]:
Km2=KMeans(n_clusters=20, random_state=42,n_init=50)
In [37]:
clust_km50_20=Km2.fit_predict(pca_fing50)
In [38]:
from sklearn.metrics import confusion_matrix
In [39]:
Counter(clust_km50_20)
Out[39]:
Lets reorder by this 20 cluster kmeans
In [40]:
dist_center2=euclidean_distances(Km2.cluster_centers_,Km2.cluster_centers_[0])
dist_sorted2=sorted(dist_center2)
sort_key2=[]
for i in range(20):
sort_key2.append(np.where(dist_center2==dist_sorted2[i])[0][0])
In [41]:
print sort_key2
In [42]:
clust_km50_s2=np.zeros(len(clust_km50_20),dtype=int)
for i in range(20):
clust_km50_s2[clust_km50_20==sort_key2[i]]=int(i)
In [43]:
Counter(clust_km50_s2)
Out[43]:
In [44]:
Df["clust_pca50_20"]=clust_km50_s2
In [45]:
Df_sorted2=Df.sort_values(by="clust_pca50_20")
In [46]:
Df_sorted2.head()
Out[46]:
In [47]:
Formula_s2=Df_sorted2["Formula"]
Finger_s2=Df_sorted2.drop(["Formula","clust_pca50","clust_pca50_20"],axis=1).values
In [97]:
Finger_s2.shape
Out[97]:
In [49]:
pca3=PCA(n_components=50)
finger_pca_s2=pca3.fit_transform(Finger_s2)
In [50]:
dist_fing_s2=euclidean_distances(finger_pca_s2)
In [51]:
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s2[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()
Out[51]:
In [52]:
dist_cosine_s2=cosine_distances(finger_pca_s2)
In [148]:
plt.figure(figsize=(10,10))
plt.imshow(dist_cosine_s2[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting using cosine distances")
plt.colorbar()
Out[148]:
In [55]:
from sklearn.cluster import DBSCAN
Db_try=DBSCAN(eps=0.1,min_samples=5,metric='precomputed')
In [104]:
Db_try.eps=0.12
Db_try.min_samples=9
Db_cluster_cos=Db_try.fit_predict(dist_cosine_s2)
print Counter(Db_cluster_cos)
In [75]:
Counter(Db_cluster_cos)
Out[75]:
We label the first 10 elements of each of the 15 cluster KMeans with the cluster number and see how Kmeans compares with label propogation. The resulting similarity matrix ( after resorting) is actually pretty reasonable. But Kmeans still gives you the prettiest clusters
In [138]:
Ylabel=np.ones(len(clust_km50_s),dtype=int)*-1
Ylabel[0:10]=0
for i in range(1,15):
Ylabel[np.where(clust_km50_s==i)[0][0:10]]=i
In [139]:
from sklearn.semi_supervised import LabelPropagation
Lab=LabelPropagation()
In [140]:
Lab=LabelPropagation(gamma=0.5)
Lab.fit(fing_pca50_s,Ylabel)
clusts_lab=Lab.predict(fing_pca50_s)
Counter(clusts_lab)
Out[140]:
In [127]:
from sklearn.metrics import adjusted_rand_score,confusion_matrix,accuracy_score
In [141]:
confusion_matrix(clust_km50_s,clusts_lab)
Out[141]:
In [142]:
adjusted_rand_score(clust_km50_s,clusts_lab)
Out[142]:
In [143]:
Counter(clusts_lab)
Out[143]:
In [144]:
Df_sorted["clust_lab"]=clusts_lab
In [133]:
Df_sorted.columns
Out[133]:
In [145]:
Df_sort_lab=Df_sorted.sort_values(by="clust_lab")
fing_lab_s=Df_sort_lab.drop(['Formula','clust_pca50','clust_lab'],axis=1).values
pca4=PCA(n_components=50)
fing_lab_pca_s=pca4.fit_transform(fing_lab_s)
dist_lab_cos_s=cosine_distances(fing_lab_pca_s)
In [149]:
plt.figure(figsize=(10,10))
plt.imshow(dist_lab_cos_s[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting by Label Propogation Algorithm")
plt.colorbar()
Out[149]:
Note that this is no way means Kmeans is awesome or that Label Propogation sucks. We were using the Kmeans labels to label things, so the best it could do was reproduce Kmeans, more or less. It is actually impressive that it gives pretty reasonable results. Higher number of labels would help, but this is not worth redoing as we know what we will get
In [ ]: